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Abstract 

Recognizing analogies, synonyms, anto- 
nyms, and associations appear to be four 
distinct tasks, requiring distinct NLP al- 
gorithms. In the past, the four tasks have 
been treated independently, using a wide 
variety of algorithms. These four seman- 
tic classes, however, are a tiny sample of 
the full range of semantic phenomena, and 
we cannot afford to create ad hoc algo- 
rithms for each semantic phenomenon; we 
need to seek a unified approach. We pro- 
pose to subsume a broad range of phenom- 
ena under analogies. To limit the scope of 
this paper, we restrict our attention to the 
subsumption of synonyms, antonyms, and 
associations. We introduce a supervised 
corpus-based machine learning algorithm 
for classifying analogous word pairs, and 
we show that it can solve multiple-choice 
SAT analogy questions, TOEFL synonym 
questions, ESL synonym-antonym ques- 
tions, and similar-associated-both ques- 
tions from cognitive psychology. 

1 Introduction 

A pair of words (petrify: stone) is analogous to an- 
other pair (vaporize:gas) when the semantic re- 
lations between the words in the first pair are 
highly similar to the relations in the second pair. 
Two words (levied and imposed) are synonymous 
in a context (levied a tax) when they can be in- 
terchanged (imposed a tax), they are are antony- 
mous when they have opposite meanings (black 
and white), and they are associated when they tend 
to co-occur (doctor and hospital). 



On the surface, it appears that these are four dis- 
tinct semantic classes, requiring distinct NLP al- 
gorithms, but we propose a uniform approach to 
all four. We subsume synonyms, antonyms, and 
associations under analogies. In essence, we say 
that X and Y are antonyms when the pair X:Y 
is analogous to the pair black: white, X and Y 
are synonyms when they are analogous to the pair 
levied:imposed, and X and Y are associated when 
they are analogous to the pair doctor:hospital. 

There is past work on recognizing 
analogies ( |Reitman, 1965| l, synonyms 
(Landau er and Dumais, 1997l l, antonyms 



(Lin et al., 2003 1 ), and associations ( |Lesk, 1969] ), 
but each of these four tasks has been examined 
separately, in isolation from the others. As far as 
we know, the algorithm proposed here is the first 
attempt to deal with all four tasks using a uniform 
approach. We believe that it is important to seek 
NLP algorithms that can handle a broad range 
of semantic phenomena, because developing a 
specialized algorithm for each phenomenon is a 
very inefficient research strategy. 

It might seem that a lexicon, such as Word- 
Net ( jFellbaum, 1998| ), contains all the information 
we need to handle these four tasks. However, we 
prefer to take a corpus-based approach to seman- 
tics. Veale (I2004I I used WordNet to answer 374 
multiple-choice SAT analogy questions, achieving 
an accuracy of 43%, but the best corpus-based ap- 
proach attains an accuracy of 56% ( [Turney, 2006| ). 
Another reason to prefer a corpus-based approach 
to a lexicon-based approach is that the former re- 
quires less human labour, and thus it is easier to 
extend to other languages. 

In Section ID we describe our algorithm for 



recognizing analogies. We use a standard su- 
pervised machine learning approach, with feature 
vectors based on the frequencies of patterns in a 
large corpus. We use a support vector machine 
(SVM) to learn how to classify the feature vectors 
dPlatt , 1998 ; Witten an d Frank, 1999| l. 

Section [3]presents four sets of experiments. We 
apply our algorithm for recognizing analogies to 
multiple-choice analogy questions from the SAT 
college entrance test, multiple-choice synonym 
questions from the TOEFL (test of English as a 
foreign language), ESL (English as a second lan- 
guage) practice questions for distinguishing syn- 
onyms and antonyms, and a set of word pairs that 
are labeled similar, associated, and both, devel- 
oped for experiments in cognitive psychology. 

We discuss the results of the experiments in Sec- 
tion m The accuracy of the algorithm is competi- 
tive with other systems, but the strength of the al- 
gorithm is that it is able to handle all four tasks, 
with no tuning of the learning parameters to the 
particular task. It performs well, although it is 
competing against specialized algorithms, devel- 
oped for single tasks. 

Related work is examined in Section [5] and lim- 
itations and future work are considered in Sec- 
tion [6l We conclude in Section |7] 

2 Classifying Analogous Word Pairs 

An analogy, A:B::C:D, asserts that A is to S as C 
is to D; for example, traffic:street::water:riverbed 
asserts that traffic is to street as water is to riverbed; 
that is, the semantic relations between traffic and 
street are highly similar to the semantic relations 
between water and riverbed. We may view the 
task of recognizing word analogies as a problem 
of classifying word pairs (see Table [T]). 



Word pair 



Class label 



carpenter:wood 
mason: stone 
potterxlay 
glassblower:glass 
traffic: street 
water:riverbed 
packets :network 
gossip:grapevine 



artisan:material 
artisan:material 
artisan:material 
artisan:material 
entity :carrier 
entity :carrier 
entity :carrier 
entity :carrier 



Table 1: Examples of how the task of recogniz- 
ing word analogies may be viewed as a problem of 
classifying word pairs. 

We approach this as a standard classification 



problem for supervised machine learning. The al- 
gorithm takes as input a training set of word pairs 
with class labels and a testing set of word pairs 
without labels. Each word pair is represented as a 
vector in a feature space and a supervised learning 
algorithm is used to classify the feature vectors. 
The elements in the feature vectors are based on 
the frequencies of automatically defined patterns 
in a large corpus. The output of the algorithm is an 
assignment of labels to the word pairs in the test- 
ing set. For some of the experiments, we select 
a unique label for each word pair; for other ex- 
periments, we assign probabilities to each possible 
label for each word pair. 

For a given word pair, such as mason: stone, the 
first step is to generate morphological variations, 
such as masons:stones. In the following experi- 
ments, we use morpha (morphological analyzer) 
and morphg (morphological generator) for mor- 
phological processing ( Minnen et al, 2001d {11 

The second step is to search in a large corpus for 
all phrases of the following form: 
"[0 to 1 words] X [0 to 3 words] y [0 to 1 words]" 
In this template, X:Y consists of morphological 
variations of the given word pair, in either or- 
der; for example, mason:stone, stone:mason, ma- 
sons:stones, and so on. A typical phrase for ma- 
son:stone would be "the mason cut the stone with". 
We then normalize all of the phrases that are found, 
by using morpha to remove suffixes. 

The template we use here is similar to Turney 
(2()06), but we have added extra context words 
before the X and after the Y. Our morpholog- 
ical processing also differs from Turney (12006 1) . 
In the following experiments, we search in a cor- 
pus of 5 X IQio words (about 280 GB of plain 
text), consisting of web pages gathered by a web 
crawlerJl To retrieve phrases from the corpus, we 
use Wumpus dBilttcher and Clarke, 2005| ), an effi- 
cient search engine for passage retrieval from large 
corporaJl 

The next step is to generate patterns from all 
of the phrases that were found for all of the in- 
put word pairs (from both the training and testing 
sets). To generate patterns from a phrase, we re- 
place the given word pairs with variables, X and 
Y, and we replace the remaining words with a wild 



'http://www.informatics.susx.ac.uk/research/groups/nlp/ 
carroll/morph.html. 

^The corpus was collected by Charles Clarke, University 
of Waterloo. We can provide copies on request. 

^http://www. wumpus-search.org/. 



card symbol (an asterisk) or leave them as they are. 
For example, the phrase "the mason cut the stone 
with" yields the patterns "the X cut * Y with", "* 
X * the Y *", and so on. If a phrase contains n 
words, then it yields 2^"^^^ patterns. 

Each pattern corresponds to a feature in the fea- 
ture vectors that we will generate. Since a typi- 
cal input set of word pairs yields millions of pat- 
terns, we need to use feature selection, to reduce 
the number of patterns to a manageable quantity. 
For each pattern, we count the number of input 
word pairs that generated the pattern. For example, 
"* X cut * Y *" is generated by both mason:stone 
and carpenter:wood. We then sort the patterns in 
descending order of the number of word pairs that 
generated them. If there are N input word pairs 
(and thus feature vectors, including both the 
training and testing sets), then we select the top 
kN patterns and drop the remainder. In the fol- 
lowing experiments, k is set to 20. The algorithm 
is not sensitive to the precise value of k. 

The reasoning behind the feature selection al- 
gorithm is that shared patterns make more useful 
features than rare patterns. The number of features 
(kN) depends on the number of word pairs (N), 
because, if we have more feature vectors, then we 
need more features to distinguish them. Turney 
(12006 ) also selects patterns based on the number 
of pairs that generate them, but the number of se- 
lected patterns is a constant (8000), independent of 
the number of input word pairs. 

The next step is to generate feature vectors, one 
vector for each input word pair. Each of the N 
feature vectors has kN elements, one element for 
each selected pattern. The value of an element in 
a vector is given by the logarithm of the frequency 
in the corpus of the corresponding pattern for the 
given word pair. For example, suppose the given 
pair is mason:stone and the pattern is "* X cut 
* Y *". We look at the normalized phrases that 
we collected for mason: stone and we count how 
many match this pattern. If / phrases match the 
pattern, then the value of this element in the fea- 
ture vector is log(/ + 1) (we add 1 because log(O) 
is undefined). Each feature vector is then normal- 
ized to unit length. The normalization ensures that 
features in vectors for high-frequency word pairs 
(traffic: street) are comparable to features in vectors 
for low-frequency word pairs (water:riverbed). 

Now that we have a feature vector for each in- 
put word pair, we can apply a standard supervised 



learning algorithm. In the following experiments, 
we use a sequential minimal optimization (SMO) 
support vector machine (SVM) with a radial ba- 
sis function (RBF) kernel dPlatt, 19981 ), as imple- 
mented in Weka (Waikato Environment for Knowl- 
edge Analysis) dWitten and Frank, 1999| )B The al- 
gorithm generates probability estimates for each 
class by fitting logistic regression models to the 
outputs of the SVM. We disable the normalization 
option in Weka, since the vectors are already nor- 
malized to unit length. We chose the SMO RBF 
algorithm because it is fast, robust, and it easily 
handles large numbers of features. 

For convenience, we will refer to the above algo- 
rithm as PairClass. In the following experiments, 
PairClass is applied to each of the four problems 
with no adjustments or tuning to the specific prob- 
lems. Some work is required to fit each problem 
into the general framework of PairClass (super- 
vised classification of word pairs) but the core al- 
gorithm is the same in each case. 

3 Experiments 

This section presents four sets of experiments, with 
analogies, synonyms, antonyms, and associations. 
We explain how each task is treated as a problem 
of classifying analogous word pairs, we give the 
experimental results, and we discuss past work on 
each of the four tasks. 

3.1 SAT Analogies 

In this section, we apply PairClass to the task 
of recognizing analogies. To evaluate the perfor- 
mance, we use a set of 374 multiple-choice ques- 
tions from the SAT college entrance exam. Table |2] 
shows a typical question. The tai^get pair is called 
the stem. The task is to select the choice pair that 
is most analogous to the stem pair. 



Stem: 




mason: stone 


Choices: 


(a) 


teacher:chalk 




(b) 


carpenter:wood 




(c) 


soldier:gun 




(d) 


photograph:camera 




(e) 


book: word 


Solution: 


(b) 


carpenter: wood 



Table 2: An example of a question from the 374 
SAT analogy questions. 

The problem of recognizing word analogies 



'*http://www.cs.waikato.ac.nz/ml/weka/. 



was first attempted with a system called Argus 
( [Reitman, 1965| ), using a small hand-built seman- 
tic network with a spreading activation algorithm. 
Turney et al. (120031 ) used a combination of 13 in- 
dependent modules. Veale (120041) used a spread- 
ing activation algorithm with WordNet (in effect, 
treating WordNet as a semantic network). Turney 
(120061) used a corpus-based algorithm. 

We may view Table |2] as a binary classifica- 
tion problem, in which mason:stone and carpen- 
ter: wood are positive examples and the remaining 
word pairs are negative examples. The difficulty is 
that the labels of the choice pairs must be hidden 
from the learning algorithm. That is, the training 
set consists of one positive example (the stem pair) 
and the testing set consists of five unlabeled exam- 
ples (the five choice pairs). To make this task more 
tractable, we randomly choose a stem pair from 
one of the 373 other SAT analogy questions, and 
we assume that this new stem pair is a negative ex- 
ample, as shown in Table [3] 



Word pair 


Train or test 


Class label 


mason: stone 


train 


positive 


tutor:pupil 


train 


negative 


teacher:chalk 


test 


hidden 


carpenter: wood 


test 


hidden 


soldier:gun 


test 


hidden 


photograph:camera 


test 


hidden 


book:word 


test 


hidden 


Table 3: How to fit 


a SAT analogy question into 



the framework of supervised pair classification. 

To answer the SAT question, we use PairClass to 
estimate the probability that each testing example 
is positive, and we guess the testing example with 
the highest probability. Learning from a training 
set with only one positive example and one nega- 
tive example is difficult, since the learned model 
can be highly unstable. To increase the stability, 
we repeat the learning process 10 times, using a 
different randomly chosen negative training exam- 
ple each time. For each testing word pair, the 10 
probability estimates are averaged together. This 



is a form of bagging ( |Breiman, 1996| ). 

PairClass attains an accuracy of 52.1%. For 
comparison, the ACL Wiki lists 12 previously pub- 
lished results with the 374 SAT analogy ques- 
tionsjl Only 2 of the 12 algorithms have higher 
accuracy. The best previous result is an accu- 



racy of 56.1% (Turney, 2006 1. Random guessing 
would yield an accuracy of 20%. The average 
senior high school student achieves 57% correct 



(Turney, 20061. 



3.2 TOEFL Synonyms 

Now we apply PairClass to the task of recogniz- 
ing synonyms, using a set of 80 multiple-choice 
synonym questions from the TOEFL (test of En- 
glish as a foreign language). A sample question is 
shown in Table ID The task is to select the choice 
word that is most similar in meaning to the stem 
word. 



Stem: 




levied 


Choices: 


(a) 


imposed 




(b) 


believed 




(c) 


requested 




(d) 


correlated 


Solution: 


(a) 


imposed 



Table 4: An example of a question from the 80 
TOEFL questions. 

Synonymy can be viewed as a high de- 
gree of semantic similarity. The most com- 
mon way to measure semantic similarity is 
to measure the distance between words in 
WordNet dResnik, 19951 [Jiang and Conrath, 1997 



Hirst and St-Onge, 1998[ ). Corpus-based measures 
of word similarity are also common ( |Lesk, 1969 



Landauer and Dumais, 1 997 [ [Turney, 2001 



We may view Table [H as a binary classifica- 
tion problem, in which the pair levied:imposed is a 
positive example of the class synonymous and the 
other possible pairings are negative examples, as 
shown in Tabled 



Word pair 


Class label 


levied:imposed 


positive 


levied:believed 


negative 


levied:requested 


negative 


levied:correlated 


negative 



^For more information, see SAT Analogy Questions (State 
of the art) at http://aclweb.org/aclwiki/ 



Table 5: How to fit a TOEFL question into the 
framework of supervised pair classification. 

The 80 TOEFL questions yield 320 (80 x 4) 
word pairs, 80 labeled positive and 240 labeled 
negative. We apply PairClass to the word pairs us- 
ing ten-fold cross-validation. In each random fold, 
90% of the pairs are used for training and 10% 
are used for testing. For each fold, the model that 
is learned from the training set is used to assign 



probabilities to the pairs in the testing set. With 
ten separate folds, the ten non-overlapping test- 
ing sets cover the whole dataset. Our guess for 
each TOEFL question is the choice with the high- 
est probability of being positive, when paired with 
the corresponding stem. 

PairClass attains an accuracy of 76.2%. For 
comparison, the ACL Wiki lists 15 previously pub- 
lished results with the 80 TOEFL synonym ques- 
tionsjl Of the 15 algorithms, 8 have higher ac- 
curacy and 7 have lower. The best previous re- 



sult is an accuracy of 97.5% (Tumey et al., 2003 1, 
obtained using a hybrid of four different algo- 
rithms. Random guessing would yield an ac- 
curacy of 25%. The average foreign appli- 
cant to a US university achieves 64.5% correct 



(Landauer and Dumais, 1997). 



3.3 Synonyms and Antonyms 

The task of classifying word pairs as either syn- 
onyms or antonyms readily fits into the framework 
of supervised classification of word pairs. Table [6] 
shows some examples from a set of 136 ESL (En- 
glish as a second language) practice questions that 
we collected from various ESL websites. 



Word pair 


Class label 


galling:irksome 


synonyms 


yield:bend 


synonyms 


naive: callow 


synonyms 


advise:suggest 


synonyms 


dissimilarity :resemblance 


antonyms 


commend:denounce 


antonyms 


expose:camouflage 


antonyms 


unveil: veil 


antonyms 



Table 6: Examples of synonyms and antonyms 
from 136 ESL practice questions. 



Lin et al. (120031) distinguish synonyms from 
antonyms using two patterns, "from X to Y" and 
"either X or Y". When X and Y are antonyms, 
they occasionally appear in a large corpus in one 
of these two patterns, but it is very rare for syn- 
onyms to appear in these patterns. Our approach 
is similar to Lin et al. (120031 ). but we do not rely 
on hand-coded patterns; instead, PairClass patterns 
are generated automatically. 

Using ten-fold cross-validation, PairClass at- 
tains an accuracy of 75.0%. Always guessing 
the majority class would result in an accuracy of 



65.4%. The average human score is unknown and 
there are no previous results for comparison. 

3.4 Similar, Associated, and Both 

A common criticism of corpus-based measures of 
word similarity (as opposed to lexicon-based mea- 
sures) is that they are merely detecting associa- 
tions (co-occurrences), rather than actual semantic 
similarity ( |Lund et al., 1995| ). To address this crit- 
icism, Lund et al. (119951 ) evaluated their algorithm 
for measuring word similarity with word pairs that 
were labeled similar, associated, or both. These 
labeled pairs were originally created for cogni- 
tive psychology experiments with human subjects 
( |Chiarello et al., 1990| . Table |7] shows some ex- 
amples from this collection of 144 word pairs (48 
pairs in each of the three classes). 



Word pair 



Class label 



table: bed 


similar 


music:art 


similar 


hair:fur 


similar 


house:cabin 


similar 


cradle:baby 


associated 


mug:beer 


associated 


camehhump 


associated 


cheese:mouse 


associated 


ale:beer 


both 


uncle: aunt 


both 


pepper:salt 


both 


frown: smile 


both 



Table 7: Examples of word pairs labeled similar, 
associated, or both. 



^For more information, see TOEFL Synonym Questions 
(State of the art) at http://aclweb.org/aclwiki/ 



Lund et al. (119951 ) did not measure the accuracy 
of their algorithm on this three-class classification 
problem. Instead, following standard practice in 
cognitive psychology, they showed that their al- 
gorithm's similarity scores for the 144 word pairs 
were correlated with the response times of human 
subjects in priming tests. In a typical priming test, 
a human subject reads a priming word (cradle) and 
is then asked to complete a partial word (complete 
bab as baby). The time required to perform the 
task is taken to indicate the strength of the cogni- 
tive link between the two words (cradle and baby). 

Using ten-fold cross-validation, PairClass at- 
tains an accuracy of 77.1% on the 144 word pairs. 
Since the three classes are of equal size, guessing 
the majority class and random guessing both yield 
an accuracy of 33.3%. The average human score 
is unknown and there are no previous results for 



comparison. 

4 Discussion 

The four experiments are summarized in Tables [8] 
and m For the first two experiments, where there 
are previous results, PairCIass is not the best, but 
it performs competitively. For the second two ex- 
periments, PairCIass performs significantly above 
the baselines. However, the strength of this ap- 
proach is not its performance on any one task, but 
the range of tasks it can handle. 

As far as we know, this is the first time a stan- 
dard supervised learning algorithm has been ap- 
plied to any of these four problems. The advan- 
tage of being able to cast these problems in the 
framework of standard supervised learning prob- 
lems is that we can now exploit the huge literature 
on supervised learning. Past work on these prob- 
lems has required implicitly coding our knowl- 
edge of the nature of the task into the struc- 
ture of the algorithm. For example, the struc- 
ture of the algorithm for latent semantic analysis 
(LSA) implicitly contains a theory of synonymy 
dLandauer and Dumais, 1997D . The problem with 
this approach is that it can be very difficult to work 
out how to modify the algorithm if it does not be- 
have the way we want. On the other hand, with 
a supervised learning algorithm, we can put our 
knowledge into the labeling of the feature vectors, 
instead of putting it directly into the algorithm. 
This makes it easier to guide the system to the de- 
sired behaviour. 

With our approach to the SAT analogy ques- 
tions, we are blurring the line between supervised 
and unsupervised learning, since the training set 
for a given SAT question consists of a single real 
positive example (and a single "virtual" or "simu- 
lated" negative example). In effect, a single exam- 
ple (mason: stone) becomes a sui generis; it con- 
stitutes a class of its own. It may be possible 
to apply the machinery of supervised learning to 
other problems that apparently call for unsuper- 
vised learning (for example, clustering or measur- 
ing similarity), by using this sui generis device. 

5 Related Work 

One of the first papers using supervised ma- 
chine learning to classify word pairs was Rosario 
and Hearst's (12001b paper on classifying noun- 
modifier pairs in the medical domain. For ex- 
ample, the noun-modifier expression brain biopsy 



was classified as Procedure. Rosario and Hearst 
(120011 ) constructed feature vectors for each noun- 
modifier pair using MeSH (Medical Subject Head- 
ings) and UMLS (Unified Medical Language Sys- 
tem) as lexical resources. They then trained a neu- 
ral network to distinguish 13 classes of semantic 
relations, such as Cause, Location, Measure, and 
Instrument. Nastase and Szpakowicz (2003) ex- 
plored a similar approach to classifying general- 
domain noun-modifier pairs, using WordNet and 
Roget's Thesaurus as lexical resources. 

Turney and Liftman (120051 ) used corpus-based 
features for classifying noun-modifier pairs. Their 
features were based on 128 hand-coded patterns. 
They used a nearest-neighbour learning algorithm 
to classify general-domain noun-modifier pairs 
into 30 different classes of semantic relations. Tur- 
ney (2006) later addressed the same problem using 
8000 automatically generated patterns. 

One of the tasks in SemEval 2007 was the clas- 
sification of semantic relations between nominals 



(Girju et al., 2007 ). The problem is to classify se- 
mantic relations between nouns and noun com- 
pounds in the context of a sentence. The task 
attracted 14 teams who created 15 systems, all 
of which used supervised machine learning with 
features that were lexicon-based, corpus-based, or 
both. 

PairCIass is most similar to the algorithm of Tur- 
ney (12006b . but it differs in the following ways: 

• PairCIass does not use a lexicon to find syn- 
onyms for the input word pairs. One of our 
goals in this paper is to show that a pure 
corpus-based algorithm can handle synonyms 
without a lexicon. This considerably simpli- 
fies the algorithm. 

• PairCIass uses a support vector machine 
(SVM) instead of a nearest neighbour (NN) 
learning algorithm. 

• PairCIass does not use the singular value 
decomposition (SVD) to smooth the feature 
vectors. It has been our experience that SVD 
is not necessary with SVMs. 

• PairCIass generates probability estimates, 
whereas Turney (12006b uses a cosine mea- 
sure of similarity. Probability estimates can 
be readily used in further downstream pro- 
cessing, but cosines are less useful. 

• The automatically generated patterns in Pair- 
Class are slightly more general than the pat- 
terns of Turney (l2006l) . 



Experiment 



Number of vectors 



Number of features 



Number of classes 



SAT Analogies 2,244 

TOEFL Synonyms 320 

Synonyms and Antonyms 136 

Similar, Associated, and Both 144 



(374 X 6) 
(80 X 4) 



44,880 
6,400 
2,720 
2,880 



(2, 244 X 20) 
(320 X 20) 
(136 X 20) 
(144 X 20) 



374 
2 
2 
3 



Table 8: Summary of the four tasks. See Section[3]for explanations. 



Experiment 



Accuracy Best previous Human Baseline 



Rank 



SAT Analogies 52.1% 56.1% 

TOEFL Synonyms 76.2% 97.5% 

Synonyms and Antonyms 75.0% none 

Similar, Associated, and Both 77.1% none 



57.0% 20.0% 2 higher out of 12 

64.5% 25.0% 8 higher out of 15 

unknown 65.4% none 

unknown 33.3% none 



Table 9: Summary of experimental results. See Section[3]for explanations. 



• The morphological processing in PairClass 

is more sophisticated 



(Minnen etal, 2001 



than in Tumey (12006 l l. 

However, we believe that the main contribution of 
this paper is not PairClass itself, but the extension 
of supervised word pair classification beyond the 
classification of noun-modifier pairs and seman- 
tic relations between nominals, to analogies, syn- 
onyms, antonyms, and associations. As far as we 
know, this has not been done before. 

6 Limitations and Future Work 

The main limitation of PairClass is the need for a 
large corpus. Phrases that contain a pair of words 
tend to be more rare than phrases that contain ei- 
ther of the members of the pair, thus a large cor- 
pus is needed to ensure that sufficient numbers of 
phrases are found for each input word pair. The 
size of the corpus has a cost in terms of disk space 
and processing time. In the future, as hardware im- 
proves, this will become less of an issue, but there 
may be ways to improve the algorithm, so that a 
smaller corpus is sufficient. 

Another area for future work is to apply Pair- 
Class to more tasks. WordNet includes more than 
a dozen semantic relations (e.g., synonyms, hy- 
ponyms, hypemyms, meronyms, holonyms, and 
antonyms). PairClass should be applicable to all 
of these relations. Other potential applications in- 
clude any task that involves semantic relations, 
such as word sense disambiguation, information 
retrieval, information extraction, and metaphor in- 
terpretation. 

7 Conclusion 

In this paper, we have described a uniform ap- 
proach to analogies, synonyms, antonyms, and as- 



sociations, in which all of these phenomena are 
subsumed by analogies. We view the problem of 
recognizing analogies as the classification of se- 
mantic relations between words. 

We believe that most of our lexical knowledge 
is relational, not attributional. That is, meaning is 
largely about relations among words, rather than 
properties of individual words, considered in iso- 
lation. For example, consider the knowledge en- 
coded in WordNet: much of the knowledge in 
WordNet is embedded in the graph structure that 
connects words. 

Analogies of the form A:B::C:D are called 
proportional analogies. These types of lower- 
level analogies may be contrasted with higher- 
level analogies, such as the analogy between the 
solar system and Rutherford's model of the atom 
( |Falkenhainer et al., 1989| ), which are sometimes 
called conceptual analogies. We believe that the 
difference between these two types is largely a 
matter of complexity. A higher-level analogy is 
composed of many lower-level analogies. Progress 
with algorithms for processing lower-level analo- 
gies will eventually contribute to algorithms for 
higher-level analogies. 

The idea of subsuming a broad range of se- 
mantic phenomena under analogies has been sug- 
gested by many researchers. Minsky ( I1986I I 
wrote, "How do we ever understand anything? 
Almost always, I think, by using one or an- 
other kind of analogy." Hofstadter ( 120071 ) 
claimed, "all meaning comes from analogies." In 
NLP, analogical algorithms have been applied to 
machine translation ( Lepage and Denoual, 2005| , 



morphology ( [Lepage, 1998[ ), and semantic rela- 
tions (ITumey and Littman, 20051). Analogy pro- 



vides a framework that has the potential to unify 
the field of semantics. This paper is a small step 



towards that goal. 
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